Formatting module, system and method for formatting an electronic character sequence

ABSTRACT

There is provided a formatting module configured to format spaces in an electronic character sequence. The formatting module supports at least one language and comprises a language identifier configured to identify whether the electronic character sequence is written in a supported language, and a character identifier configured to identify a particular character or a particular sequence of characters in the electronic character sequence. The formatting module is configured to format spaces in the electronic character sequence on the basis of the language identified and the particular character identified or the particular sequence of characters identified, when a supported language is identified. A system and method for formatting text are also provided.

FIELD OF THE INVENTION

The present invention relates to the formatting of spaces in anelectronic character sequence. In particular, it relates to a formattingmodule, system and method for formatting spaces in an electroniccharacter sequence.

BACKGROUND

Punctuation marks are symbols that indicate the structure andorganization of written language, as well as intonation and pauses to beobserved when reading aloud. The appearance and usage of punctuationmarks varies between languages and scripts but in most cases they arevital to disambiguate the meaning of sentences. The use andinterpretation of punctuation marks can be heavily context-dependent.For example, a full stop “.” can be used as sentence-ending punctuation,an abbreviation indicator, a decimal point, and so on. Punctuation isalso present in mathematical and scientific formulae.

Some punctuators appear in pairs and one cannot exist without the other.For example, left parenthesis ‘(’ and right parenthesis ‘)’. However, insome scenarios a single character is used to represent two punctuators,creating ambiguity, for example in the case of the single quote mark: ‘.

A space is a blank area, often used to separate words, letters, numbers,and punctuation. Conventions for the formatting of spaces vary amonglanguages. For example, the correct formatting of spaces around aquestion mark “?” in English is “word?”, with no space between the wordand the question mark, and a space following the question mark. However,in French the convention is “word ?”, where a space is inserted eitherside of the question mark.

A number of current-market text input systems exhibit some form of spaceformatting. For example, when a user enters one of the followingcharacters [ ? ! : ; , . ] after entering a space, the Android defaultkeyboard formats spaces either side of the punctuation mark by removingthe leading space and adding a trailing space, irrespective of thelanguage in which the text is being entered.

It is an object of the present invention to provide a means forformatting automatically the spaces in an electronic character sequence,such that a user can concentrate on the content of a message withoutworrying about whether the spaces are correctly formatted in theelectronic character sequence. It is also an object of the invention toprovide a means for correctly formatting spaces in an electroniccharacter sequence on the basis of the conventions of the language inwhich the electronic character sequence is written.

SUMMARY OF THE INVENTION

In a first aspect of the present invention, there is provided aformatting module supporting at least one language and configured toformat spaces in an electronic character sequence written in a supportedlanguage, the formatting module comprising:

-   -   a language identifier configured to identify whether the        electronic character sequence is written in a supported        language;    -   a character identifier configured to identify a particular        character or a particular sequence of characters in the        electronic character sequence;    -   wherein the formatting module is configured to format spaces in        the electronic character sequence on the basis of the language        identified and the particular character or sequence of        characters identified, when a supported language is identified.

Preferably, formatting spaces in the electronic character sequencecomprises inserting and/or deleting spaces in the electronic charactersequence.

In a preferred embodiment, the character identifier comprises:

-   -   at least one set of rules, each rule relating to a particular        character or sequence of characters to be identified in the        electronic character sequence; and    -   a comparison mechanism configured to compare each rule of one of        the at least one set of rules to the electronic character        sequence to identify whether a rule is applicable;    -   wherein each rule is associated with one or more actions which        describe the format of spaces to be applied by the formatting        module to the electronic character sequence given a supported        language and the particular character or sequence of characters;        and    -   wherein the formatting module is configured to format spaces in        the electronic character sequence by applying the one or more        actions associated with the applicable rule to the electronic        character sequence.

The comparison mechanism is preferably configured to compare each ruleof one of the at least one set of rules to the electronic charactersequence only when a supported language is identified.

Preferably, the formatting module supports a plurality of languages andthe language identifier is configured further to identify the mostlikely language of the supported languages that the electronic charactersequence is written in.

The character identifier may be configured to identify a punctuationmark and the formatting module may be configured to format the spaceseither side of the punctuation mark on the basis of the punctuationmark.

The character identifier may be configured to identify a particularcontext in the electronic character sequence and the formatting modulemay be configured to format the spaces in the electronic charactersequence on the basis of the context.

The character identifier may be configured to identify a punctuationmark in the electronic character sequence, and the formatting module maybe configured to format the spaces either side of the punctuation markon the basis of the category of punctuation mark.

The one or more actions may comprise a sequence of actions, wherein whena rule is found to be applicable, the comparison mechanism is configuredto apply the sequence of actions to the electronic character sequence.

When the formatting module is configured to support a plurality oflanguages, the character identifier preferably comprises a plurality ofsets of rules, one set of rules for each language that is supported,where the comparison mechanism is configured to compare each rule of theset of rules that corresponds to the most likely language to theelectronic character sequence.

The formatting module may comprise sets of rules relating to eachlanguage, each family of languages, and all languages in the world,wherein the rules are applied in a hierarchal structure such that, oncea supported language has been identified, the comparison mechanism firstcompares each rule from the set of rules specific to that language,followed by each rule from the set of rules applicable to the family oflanguages to which that language belongs, followed by each rule of theset of rules which are applicable to all languages until an applicablerule is identified or no applicable rule is identified and all rules areexhausted.

The comparison mechanism is preferably configured to compare the rulesin a specific predetermined order. The set of rules preferably comprisescontext rules, character rules and category rules and the comparisonmechanism is preferably configured to compare the rules in the followingorder until an applicable rule is identified or no applicable rule isidentified and all rules are exhausted: context rules, character rules,and then category rules.

In a second aspect of the invention there is provided a formattingmodule supporting at least one language and configured to format spacesin an electronic character sequence, the formatting module comprising:

-   -   a punctuation mark identifier configured to identify a        punctuation mark in the electronic character sequence;    -   wherein the formatting module is configured to format spaces in        the electronic character sequence on the basis of the language        in which the electronic character sequence is written, the        punctuation mark identified, and a context of the punctuation        mark, when a supported language is identified,

In a third aspect of the invention there is provided a system forinputting text into an electronic device comprising:

-   -   a text prediction engine configured to receive an electronic        character sequence as input and configured to generate and        output a corrected electronic character sequence; and    -   a formatting module as described above, wherein the formatting        module is configured to receive the modified electronic        character sequence as input, and to generate a formatted        character sequence by formatting spaces in the modified        electronic character sequence when a supported language is        identified.

In a fourth aspect of the invention there is provided a system forinputting text into an electronic device comprising:

-   -   a text prediction engine configured to receive an electronic        character sequence as input, the text prediction engine        comprising:        -   a language identifier configured to identify which language            the electronic character sequence is most likely written in,            and to correct the electronic character sequence on the            basis of the identified language;        -   wherein the text prediction engine is configured to generate            and output a corrected electronic character sequence and to            output the language identified;    -   a formatting module supporting at least one language and        configured to receive the language identified and the corrected        electronic character sequence, and configured to format spaces        in the electronic character sequence when the identified        language is supported, the formatting module comprising:        -   a character identifier configured to identify a particular            character or a particular sequence of characters in the            electronic character sequence;        -   wherein, the formatting module is configured to format            spaces in the electronic character sequence on the basis of            the language identified and the particular character or the            particular sequence of characters identified.

In a fifth aspect of the invention there is provided a method offormatting, with a formatting module supporting at least one languageand having a character identifier, spaces in an electronic charactersequence, the method comprising:

-   -   identifying whether the electronic character sequence is written        in a language supported by the formatting module;    -   identifying, with the character identifier, a particular        character or a particular sequence of characters in the        electronic character sequence;    -   formatting, with the formatting module, spaces in the electronic        character sequence on the basis of the language identified and        the particular character or sequence of characters identified,        when a supported language is identified.

The formatting module may comprise a language identifier to identifywhether the electronic character sequence is written in a languagesupported by the formatting module. Preferably, the formatting modulesupports a plurality of languages and the method further comprisesidentifying with the language identifier the most likely language of theelectronic character sequence.

The most likely language of the electronic character sequence may beidentified by a text prediction engine, where the method furthercomprises transmitting the most likely language to the formatting modulewhich identifies whether the most likely language is supported by theformatting module.

The language identifier preferably comprises at least one set of rulesand a comparison mechanism, each rule defining the formatting of spacesin the electronic character sequence, wherein the method furthercomprises:

-   -   comparing, with the comparison mechanism, each rule of one of        the at least one set of rules to the electronic character        sequence to identify whether a rule is applicable to the        character sequence;    -   identifying, with the comparison mechanism, that a particular        rule is applicable to the character sequence; and    -   applying the applicable rule to the electronic character        sequence to format the spaces in the electronic character        sequence.

Preferably, the comparison mechanism compares each rule of one of the atleast one set of rules to the electronic character sequence only when asupported language is identified.

Each rule may relate to a particular character or sequence of charactersto be identified and each rule is associated with one or more actionswhich describe the format of spaces to be applied by the formattingmodule to the electronic character sequence given a supported languageand the particular character or sequence of characters. In the method,the step of applying the applicable rule preferably comprises applyingthe one or more actions associated with that applicable rule to theelectronic character sequence.

Identifying a particular character may comprise identifying apunctuation mark and formatting the spaces in the electronic charactersequence may comprise formatting the spaces either side of thepunctuation mark on the basis of the form of the punctuation mark.

Identifying a particular sequence of characters may comprise identifyinga particular context and formatting the spaces in the electroniccharacter sequence may comprise formatting the spaces on the basis ofthe context.

Identifying a particular character may comprise identifying apunctuation mark and formatting the spaces in the electronic charactersequence may comprise formatting the spaces either side of thepunctuation mark on the basis of the category of punctuation mark.

Where each rule is associated with one or more actions, the one or moreactions may comprise a sequence of actions, wherein the sequence ofactions is applied sequentially to the electronic character sequence.

Where the formatting module supports a plurality of languages, thelanguage identifier may comprise a plurality of sets of rules, one setof rules for each language supported, and comparing each rule to theelectronic character sequence comprises comparing each rule of the setof rules that corresponds to the most likely language.

The formatting module may comprise sets of rules relating to eachsupported language, each family of languages, and all languages in theworld, and the method comprises applying the rules in a hierarchalstructure such that, once a language has been identified, the comparisonmechanism first compares each rule from the set of rules specific tothat language, followed by each rule from the set of rules applicable tothe family of languages to which that language belongs, followed by eachrule of the set of rules which are applicable to all languages until anapplicable rule is identified or no applicable rule is identified andall rules are exhausted.

The comparison mechanism preferably compares the rules in a specificpredetermined order.

The set of rules may comprise context rules, character rules andcategory rules, and the method preferably comprises comparing the rulesin the following order until an applicable rule is identified or noapplicable rule is identified and all rules are exhausted: contextrules, character rules, and then category rules.

In a sixth aspect of the invention there is provided a computer programproduct comprising a computer readable medium having stored thereoncomputer program means for causing a processor to carry out a method asdescribed above.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will now be described in detail with reference tothe accompanying drawings, in which:

FIG. 1 is a schematic of a system comprising a prediction engine and aformatting module in accordance with the present invention;

FIG. 2 is a schematic of a formatting module in accordance with thepresent invention;

FIG. 3 is a schematic of the formatting module of FIG. 2 shown ingreater detail;

FIG. 4 is an illustration of a structure of specific types of ruleswithin a set of rules for a given language, and shows the order in whicha comparison mechanism compares the rules, in accordance with thepresent invention;

FIG. 5 is an illustration of how the rules are structured for theEnglish language and the order in which the comparison mechanismcompares the rules, in accordance with the present invention.

DETAILED DESCRIPTION

The present invention provides a formatting module that is configured toformat the spaces for a particular sentence on the basis of theconventions for the language in which the sentence is written. Theformatting module formats the spaces by inserting and/or deleting spacesin the electronic character sequence.

Preferably, but not necessarily, the formatting module 10 is part of asystem, such as an electronic device 100, comprising a text predictionengine 30, as shown in FIG. 1. The electronic device is preferably amobile device, such as a PDA, tablet, laptop computer or mobile phone.The formatting module may be used to format the spaces in an electroniccharacter sequence entered by a user for a text message. The userinteracts with a text entry system 50 of the electronic device 100 byentering text via an input mechanism such as a virtual keyboard. In theparticular case of a predictive text entry system, the text predictionengine 30 may be configured to correct mistyped or misspelt words andmay also be configured to predict what the user is going to write next,thus improving the performance and quality of the text input into thedevice. An example of such a text prediction engine 30 is described inPCT/GB2011/001419, which is hereby incorporated by reference in itsentirety.

As can be seen from FIG. 1, a character sequence is input into thedevice 100. The character sequence is passed to a text prediction engine30 which may modify that character sequence to correct misspelt wordsand/or to predict words. The character sequence, so modified by the textprediction engine 30, is passed to the formatting module 10. Theformatting module 10 is configured to output a space formatted versionof the modified character sequence, as shown in FIGS. 1 and 2. Theformatting module formats the spaces of a character sequence byinserting and/or deleting spaces in the sequence. The formatting module10 formats the spaces for an electronic character sequence, if thelanguage in which that character sequence is written is supported by theformatting module 10.

A formatting module 10 in accordance with the present invention is shownin FIG. 2. The formatting module 10 is configured to support at leastone language. The formatting module 10 comprises a language identifier20 configured to identify whether an electronic character sequence iswritten in a language supported by the formatting module 10. Thelanguage identifier 20 makes use of one or more statistical languagemodels, the general properties of which are known in the art, in orderto identify whether the electronic character sequence is written in alanguage supported by the formatting module 10.

In a preferred embodiment, the formatting module 10 supports a pluralityof languages. Thus, in the preferred embodiment, the language identifier20 comprises a plurality of statistical languages models, eachstatistical language model corresponding to a different languagesupported by the formatting module 10, and the language identifier 20 isconfigured further to identify the most likely supported language of theelectronic character sequence. At any given stage, the formatting module10 is configured to maintain a list of “active languages”, each of whichis associated with a language model.

One process for identifying the most likely current language is tomaximize the probability of a language, given a context, i.e. maximizingP(language|context), according to the following expression (using Bayesrule):

${P\left( {{language}{context}} \right)} = \frac{{P\left( {{language}{context}} \right)}{P({language})}}{P({context})}$

As the absolute values of P(language|context) are not important, sinceonly the ranking of languages matters, the term P(context), which doesnot depend on language, may be dropped from the expression.Additionally, a uniform prior over languages, P(language)=k, may also bedropped since it is constant with respect to language. With theseassumptions, the only quantity that the language identifier is requiredto estimate is P(context|language). Typically context is just a sequenceof words, therefore to estimate P(context|language), the languageidentifier preferably uses a ‘chain’ of conditional probabilityestimates, making a ‘Markovian’ conditional independence assumption:

${P\left( {{language}{context}} \right)} \approx {\prod\limits_{i = 0}^{N_{words}}\; {P\left( {{{word}_{i}{{word}_{i - 1}\mspace{14mu} \ldots \mspace{14mu} {word}_{i - N + 1}}},{language}} \right)}}$

Each language is therefore separately modelled by a smoothed n-gramlanguage model (known in the art and as described in WO 2012/042217),capable of estimating the probability of a word, given local context.

There are other ways of estimating P(context|language), using differenttypes of language models, e.g. those that include syntactic and/orsemantic information. Another possibility would be to use a HiddenMarkov Model (HMM) to estimate a progression of unobserved language“states”. A further possibility would be to use a superviseddiscriminative classification model to predict language, e.g. a supportvector machine (SVM) or neural network.

To transform the incoming sequence of characters into a sequence ofterms the language identifier 20 uses a tokenizer as is known in theart.

In a system such as that illustrated in FIG. 1, the prediction engine 30may comprise a language identifier, rather than it being provided in theformatting module 10. As described above, the language identifier willcomprise a tokeniser and a plurality of language models, which mayalready be present in the prediction engine, such as the predictionengine described in WO 2012/042217, which is hereby incorporated byreference in its entirety.

To estimate the most likely language given context, the languageidentifier 20 is configured to calculate the likelihood of the contextin each language which is supported in turn, and selects the languagewith the maximum likelihood. The likelihood of the context (a sequenceof terms) is the product of the probability of each term, givenpreceding terms, which is computed by a smoothed n-gram model, as hasbeen described in relation to a text prediction engine in WO2012/042217.

If the user switches languages whilst typing, the formatting of thespaces around the punctuation marks may differ between the sentences,dependent on the language in which it is written, e.g. “Bonjour mon ami! How are you doing? Talk to you soon.”

To provide a formatting module 10 that is capable of identifying achange in language, for example where a user has switched languagesbetween sentences, the language identifier 20 is preferably configuredto limit the amount of context used to make the estimate of the mostlikely language. This provides a basic form of recency in the model foridentifying the most likely language—languages used more recently areintuitively more likely than languages used much earlier in a document.For instance, in one embodiment, the language identifier 20 may use thesix most recent words of context. However, the number of most recentwords of context could be chosen dependent on the frequency at which auser switches between languages and the length of their input stream inany given language.

The language identifier 20 is preferably configured to identify whetherthe language in which the electronic character sequence is written issupported by the formatting module 10. By way of a non-limiting example,the language identifier 20 may identify that the electronic charactersequence is written in an unsupported language if none of the contextterms of the sequence are present in one of the language identifier'slanguage models, where each language model corresponds to a supportedlanguage. Thus, if one or more of the context terms are determined to bepresent in one of the language models, the language identifierdetermines that the electronic character sequence is written in asupported language. A variation on this example is one in which thelanguage identifier 20 is configured to identify whether a certainfraction or ratio of the context words are present in a language model,e.g. a quarter, two-thirds or any other fraction or ratio of the contextterms are present in one of the language models, in order to determinethat the electronic character sequence is written in a supportedlanguage. Any other suitable method for determining whether the languageof the electronic character sequence is supported can be used.

As shown in FIG. 3, the character identifier 40 preferably comprises aset of rules 70, each rule relating to a character or particularsequence of characters to be identified, and a comparison mechanism 60configured to compare each rule of the set of rules 70 to the electroniccharacter sequence to determine whether a rule is applicable. If therule is applicable, then a character or particular sequence ofcharacters is identified, e.g. if the rule relates to a particularpunctuation mark and the rule is found to be applicable, it is becausethat punctuation mark is within the electronic character sequence. Theelectronic character sequence is preferably passed to the formattingmodule 10 sequentially, e.g. a character at a time, with the comparisonmechanism 60 comparing each rule to the last character or last sequenceof characters received.

Thus, the character identifier 40 uses the rules to identify when aparticular character or sequence of characters, such as a punctuationmark, occurs in the electronic character sequence. Furthermore, therules define, by one or more actions associated with the rule, the spaceformatting to apply to the electronic character sequence, i.e. whetherspaces should be inserted and/or deleted. Thus, once a rule has beenfound to be applicable to a particular character or sequence ofcharacters, the actions associated with that rule are applied to theelectronic character sequence to format the spaces within the electroniccharacter sequence, e.g. in the case of the particular character being apunctuation mark, the actions may define the formatting of the spaceseither side of the punctuation mark, as will be described in more detailbelow.

The set of rules 70 preferably comprises a plurality of sets of rules, aset of rules for each language supported by the formatting module 10.The comparison mechanism 60 is configured to compare the set of rulesrelating to the language identified by the language identifier 2 as themost likely supported language. In an embodiment in which the languageidentifier 20 supports a single language, the comparison mechanism 60comprises a single set of rules 70 corresponding to that language, andthe comparison mechanism 60 is configured to compare the set of rules 70to the electronic character sequence if the language of the charactersequence is identified as being the supported language. If the languageof the character sequence is not identified as a supported language, thecomparison mechanism 60 does not search for applicable rules.

The formatting module 10 is configured such that a system designer isable to manually add new rules, with associated actions, to theformatting module. The rules and associated actions can be updatedwithout affecting the other components of the formatting module.

A rule is preferably defined by a four-tuple, as follows: Rule :: (C, s,A, S)

:: is an operator that can be read “has type of”.

C is a condition taking the form of a regular expression, implementing afunction of type F :: [character]→{true, false}, e.g. taking theincoming character sequence and returning a boolean denoting whether ornot a rule is applicable and thus whether or not to apply the sequenceof actions associated with that rule. The comparison mechanism 60identifies a particular character or sequence of characters in anelectronic character sequence by implementing the function of the type F:: [character]→{true, false}. This field is therefore essential and isnever empty.

s represents a state that allows the system to “remember” previous ruleapplications in some cases. For example, the state may be “None” whenthe system is not required to maintain a status, or the state may be“Open” or “Close” where punctuators appear in pairs and one cannot existwithout the other, e.g. left parenthesis ‘(’ and right parenthesis ‘)’.

A is a sequence of Actions, i.e. A :: [Action]. In special cases thiscould be an empty sequence represented by [ ]. Actions are the means bywhich the formatting module 10 describes the space formatting thatshould be applied to, for example, a punctuation mark given a particularcharacter sequence context (e.g. where the punctuation mark is found inthe context of a mathematical equation). When a punctuation mark of theelectronic character sequence is determined by the comparison mechanism60 to match one of the rules, each action held by the rule is applied,preferably sequentially, to the punctuation mark to ensure the correctformatting of the spaces either side of the punctuation mark. Forexample, if the punctuation mark is a full stop, the Action might be todelete the space before the full stop (if such a space is present) andto insert a space after the full stop (if such a space is missing),where the most likely language is English.

There are two types of actions that the formatting module may comprise:type A and type B.

An action of type A is a function that operates on a sequence ofcharacters and returns a formatted sequence of characters, withoutchanging the sequence of characters, other than by formatting them:

Action A :: [character]→[character]

For example, in the case of “word.word”→“word. word”

An action of type B is a function that given a sequence of charactersreturns a code that represents the state of the system, without changingthe sequence of characters:

Action B :: [character]→new state

The new state is any of the possible states that the system might be in,e.g. the shift state to define whether the next character should becapitalised or not, e.g. “Word.”→“shift state of system”.

S is a recursive sequence of rules, known as “secondary rules”, i.e. S:: [Rule]. When the Rule does not describe any secondary rules, S willbe represented by Ø. The secondary rules will be checked before theactions of the parent rules are applied, allowing an alternativebehaviour for condition C depending on factors described by thesecondary rules. The input for the secondary rules is the sameelectronic character sequence as for the parent rules; however, thefocus of the condition C for the secondary rule is the character in thesequence that precedes the character that triggered the parent rule.

For example, in the preferred embodiment where the electronic charactersequence is passed to the formatting module sequentially, e.g. acharacter at a time, the comparison mechanism compares each parent ruleto the last character received. If a parent rule is found to beapplicable, and that parent rule comprises at least one secondary rule,the comparison mechanism compares the at least one secondary rule to thepenultimate character in the sequence (since the condition C for theparent rule is focused on the final character, whereas for the secondaryrule the focus is on the penultimate character).

The application of secondary rules will be described in more detailbelow. Since secondary rules are not essential, the general form of theRule could omit this field.

When designing the formatting module 10, the sequence of actionsassociated with a rule can be selected by a designer from apredetermined set of candidate actions. The sequence of actions maycontain any number of the candidate actions in any order and with anynumber of repetitions. As stated above, the formatting module 10 allowsa system designer of the formatting module 10, to manually extend andadapt the associated actions to the requirements of the languages or thetext entry system.

In a preferred embodiment, the formatting module comprises threespecialisations of the Rule described above: Context Rules, CategoryRules, and Character Rules. The specialised rules provide a powerfultool to capture the way punctuation is used in natural language.

A context rule is a rule of the form: Context Rule :: (C, None, A, Ø).The regular expression present in C is applied only to the context, e.g.the regular expression corresponds to a particular character sequence inthe context of the electronic character sequence, for example “www”.Since the state is “None”, a Context Rule will never have or maintainstate. The Context Rules have no “secondary rules”.

An example of a context rule is a rule for URLs which states that when“www” is in the context, no spaces should be inserted automatically oneither site of the punctuator “.”, e.g. “www.site.com”

Thus, an example of a context rule is:

Context Rule :: (‘www’, None, [DeleteSpaceBefore, DeleteSpaceAfter], Ø).

A Category Rule preferably takes the form: Category Rule :: (C, None, A,S)

This rule will match the Unicode category of the character in theelectronic character sequence to the Unicode category defined by therule, e.g. the Unicode category of a punctuation mark.

C is a regular expression that is limited to matching the Unicodecategory of the punctuation mark. Therefore, this type of Rule is onlyapplied to a single character. S is a sequence of secondary rules, e.g.a context rule, a character rule or a category rule. Alternatively thisfield can be empty, Ø, in the case where no secondary rules are defined.

An example of a category rule is:

Category Rule :: (‘P’, [DeleteSpaceBefore, InsertSpaceAfter], Ø) where Pcorresponds to a category of punctuation marks, e.g. a category thatincludes ‘!’ and ‘?’ because they should be formatted with the samespaces.

As is known in the art, characters within the Unicode standard have arange of properties associated with them. One of these properties is thecategory to which a character belongs. The condition C of a categoryrule can relate to a Unicode category. The General Category value for acharacter serves as a basic classification of that character, based onits primary usage. The property extends the widely used subdivision ofASCII characters into letters, digits, punctuation, and symbols—a usefulclassification that needs to be elaborated and further subdivided toremain appropriate for the larger and more comprehensive scope of theUnicode standard.

Each Unicode code point is assigned a normative General Category value.Each value of the General Category is given a two-letter property valuealias, where the first letter gives information about a major class andthe second letter designates a subclass of that major class. In eachclass, the subclass “other” merely collects the remaining characters ofthe major class. For example, the subclass “No” (Number, other) includesall characters of the Number class that are not a decimal digit orletter. These characters may have little in common besides theirmembership in the same major class.

A character rule preferably takes the form: Character Rule :: (C, s, A,S). This rule matches a character defined by the rule to a character inthe electronic character sequence. Therefore, in this type of rule, Ccan only contain a single character. C is a regular expressionconsisting of a single character matched against a character of theelectronic character sequence. C may define the Unicode for theparticular character of interest, this Unicode being matched to theUnicode in the electronic character sequence.

s preferably defines two new states, in addition to the None state:{Open, Close, None}. s therefore dictates actions for ambiguous pairsthat might be in an Open or Close state, and also includes no state fornon-ambiguous characters. By this definition, if for one punctuationmark two different sequences of Actions are required for differentstates, the system will define two rules. e.g. for the English languageto format the following sentence correctly:

And he said “Goodbye” and left. It was surprising.

The character rules that define the formatting of spaces in thissentence are as follows:

rule1→Character Rule :: (‘“’, Open, [InsertSpaceBefore,DeleteSpaceAfter], Ø)

rule2→Character Rule :: (‘”’, Close, [DeleteSpaceBefore,InsertSpaceAfter], Ø)

rule3→Character Rule :: (‘.’, None, [DeleteSpaceBefore,InsertSpaceAfter], Ø)

S is a sequence of secondary rules, which may include any of the threetypes of rules: Context, Category or Character. It can also define nofurther rules, in which case this field is denoted by Ø.

An example to explain the interaction of the rules and, in particular,how secondary rules are applied is now provided. In French, a space isplaced either side of an exclamation mark “!” or a question mark “?”,e.g. Bonjour !

a va ? . However, there is an exception to this rule, when anotherexclamation mark precedes the current one, e.g. Bonjour !!!

a va ? . In order for the system to deal with this situation properlysecondary rules can be defined:

rule1→Character Rule :: (‘!’, None, [InsertSpaceBefore,InsertSpaceAfter], [rule2])

rule2→Category Rule :: (‘P’, None, [DeleteSpaceBefore,InsertSpaceAfter], Ø)

In this example, P relates to Category punctuation in accordance withthe Unicode standard which occurs prior to the current character ofinterest e.g. if the formatting module 10 receives “!” in the sequence“?!”, P is the category that encompasses “?”. In the example above, whenthe user types the first ‘!’ rule1 will be triggered. Within the triggerroutine of rule1 all the secondary rules will be checked but no matcheswill happen, since there is no punctuation mark preceding the ‘!’, sothe default actions for rule1 will be applied by the formatting module10. On subsequent insertions of the exclamation mark, the step formatching secondary rules will trigger rule2 as ‘!’ is within theCategory Punctuation by the Unicode standard and the actions defined inrule2 will be applied.

The outcome if rule2 did not exist the formatting would be: Bonjour ! !!

a va ? Thus, resulting in an incorrect formatting of the text.

For a given language it is generally required to have multiple rulesdefined to ensure correct formatting of spaces in an electroniccharacter sequence. The different types of rules, i.e. context, categoryand character, are preferably applied using a priority scheme, such thatthe formatting module 10 can succinctly specify the formatting patternsof the spaces for a given language.

A couple of examples which demonstrate why it is preferable toprioritise the application of the types of rules are provided below.

For the specific case of URLs, assume that there are two rules: acontext rule which defines that when “www” is in the context, no spaceshould be inserted automatically; and a character rule that says thatwhen the full stop “.” punctuation mark is introduced, a space should beinserted afterwards. In this situation, if the character rule is appliedfirst and the user enters “www.site.com”, the result from the punctuatorwill be “www. site. com”, because the character rule for the full stopwill have preference. To format such a URL correctly, the context ruleshould have preference over the character rule and should therefore beapplied first.

In another example, where a formatting module comprises two rules: afirst category rule that states that all the characters in a Mathscategory will have spaces on either side of the Maths character, and asecond character rule that defines that the character “−” (minus) willnot have any spaces either side of it, because it is most likely to beused as a hyphen. If the user were to insert ‘−’, and the category rulewas prioritised over the character rule, the character rule would neverbe triggered. Thus, to format the sequence correctly, the character ruleshould be prioritised over the category rule.

The rule that is prioritised is applied, and the comparison mechanism 60stops the search for applicable rules. However, as described above, adifferent rule type may be applied as part of a secondary set of rules.

Thus, in the preferred embodiment, as illustrated in FIG. 4, thecomparison mechanism is configured to compare, and the formatting moduleis configured to apply, the rules for an individual language inaccordance with the following prioritisation structure:

Context Rules→Character Rules→Category Rules

To implement the prioritisation structure, the comparison mechanism 60is preferably configured to identify the type of rule. The comparisonmechanism 60 can be configured to identify the rule type by any suitablemeans. For example, each rule can be labelled with its rule type, wherethe comparison mechanism 60 is configured to identify all of the rulesof a first rule type before comparing those rules to the electroniccharacter sequence to see if one of them is applicable. The rules of agiven type can be placed in a container, so that the comparisonmechanism 60 compares all rules in a given container, before moving onto the next container. In another embodiment, the comparison mechanism60 may comprise code to identify the different rule types.

Alternatively, the rules themselves could be ordered according to theprioritisation structure, e.g. listed in accordance with theprioritisation structure.

As will be apparent from the description above, if the comparisonmechanism 60 finds that a rule is applicable, it does not continuethrough the prioritisation structure, e.g. if the category rule is foundto be applicable to www.site.com, then the character rule is notcompared or applied, since the comparison mechanism 60 has stoppedsearching for applicable rules. Otherwise this character rule, ifapplied after the context rule, would result in the incorrect formatting“www. Site. Com” as described above. However, the rule that is appliedmay comprise secondary rules of the other rule types, e.g. theformatting module dealing with repeated punctuation where the triggeredcharacter rule comprises a secondary category rule:

rule1→Character Rule :: (‘!’, None, [InsertSpaceBefore,InsertSpaceAfter], [Rule2])

rule2→Category Rule :: (‘P *’, None, [DeleteSpaceBefore,InsertSpaceAfter], Ø)

Rules may be applicable to a particular language, e.g. English, and thefamily of languages to which that language belongs, e.g. Latin, or toall languages in the world. There are multiple conventions forpunctuation that are common to a number of languages. For example, inall languages URLs are written the same way and therefore they all musthave the necessary rules for the correct formatting of these elements.

In the preferred embodiment in which the formatting module 10 supports aplurality of languages, the language identifier 20 is configured to passthe identified language to the comparison mechanism 60, and thecomparison mechanism 60 is configured to compare the rules from the setof rules 70 that are relevant given the particular language soidentified. The set of rules is preferably ordered into a hierarchalstructure, in order to avoid repeating the same rules.

Thus, in addition to the comparison mechanism 60 being configured tocompare the rules according to the rule prioritisation structure, e.g.context rule→character rule→category rule, as described above, thecomparison mechanism 60 is further configured to compare the rules in aparticular order of increasing generality:

language specific rules→language family rules→worldwide rules

To enable the comparison mechanism 60 to compare the rules in thisorder, the comparison mechanism is preferably configured to identify thelanguage generalisation rule i.e. whether the rule is a languagespecific rule, a language family rule or a worldwide rule. Thecomparison mechanism 60 may be coded to recognise the languagegeneralisation rule or each rule may be labelled to identify the type oflanguage generalisation rule, and containers may be used, as explainedabove when discussing the rule type prioritisation structure. As statedabove, an alternative could be to order the rules into thegeneralisation structure.

Thus, the comparison mechanism 60 is preferably configured to identifythe rule type and the language generalisation rule, e.g. context ruleapplicable to French (language specific rule).

As can be seen from FIG. 5, the comparison mechanism 60 compares therules in accordance with the priority system described above, until arule is found to be applicable: first all the “context rules” will becompared in order of increasing generality of language, e.g. the contextrules are checked first for language specific rules, then for familyrules, and then for worldwide rules; the comparison mechanism 60 thenproceeds to compare the next type of rule, character rules, throughincreasing generality in language, and then compares the category rulesin the same way, until a rule is found to be applicable, at which pointthe comparison mechanism 60 stops the search for an applicable rule.Alternatively, the comparison mechanism 60 compares all of the rules tofind that no rule is applicable and all rules are exhausted.

Preferably, the comparison mechanism 60 is configured to compare each ofthe rules to each character in the electronic character sequence inturn. Thus, if the comparison mechanism 60 discovers that a rule isapplicable to a character of the character sequence, the formattingmodule 10 applies this rule to the electronic character sequence toformat the spaces of the electronic character sequence, and thecomparison mechanism moves on to comparing the rules to the nextcharacter in the character sequence. Likewise, if no rule is found to beapplicable to that character, the comparison mechanism 60 moves on tocomparing the rules to the next character in the electronic charactersequence.

As will be understood from above, the language identifier 20 isconfigured to identify whether the language in which the electroniccharacter sequence is being written is supported and, preferably, whichis the most likely supported language. The language identifier 20 may beconfigured to identify the current language periodically, e.g. for everyterm (where the electronic character sequence is converted into asequence of terms or words by a tokeniser) or for example every threeterms, in order to identify whether the language has been switched bythe user and thus to change the set of rules that are being compared bythe comparison mechanism 60 to the electronic character sequence. Anyother frequency of checking may be used. If the language identifier 20determines that the language of the character sequence is not supportedby the formatting module 10, the comparison mechanism 60 stops searchingfor an applicable rule.

A formatting module 10 or system 100 comprising a formatting module 10in accordance with the present invention provides language detection andrule mechanisms that provide automatic dynamic punctuation. Unlikeexisting systems which neglect the possibility of having differentbehaviours for the same punctuation mark depending on the context inwhich the punctuation mark occurs, the formatting module 10 of thepresent invention is able to format the spaces either side of apunctuation mark on the basis of the context of the punctuation mark.

The formatting module 10 of the present invention is therefore able toincrease the productivity of the user by reducing the interactionrequired to produce correctly formatted punctuation appropriate to thetarget language. For multilingual users, the formatting module 10 ispreferably able to automatically adjust the space formatting to thelanguage currently being entered. This allows the user to focus on themessage being delivered rather than formatting conventions specific tovarious target languages.

Furthermore, the formatting module 10 of the present invention providesa separate layer that defines the behaviour of the formatting of thespaces for the punctuation, i.e. the rules and their associated actions.This allows independent manual updates of the rules and their associatedactions for a particular language, to change the space formatting forthat language, without affecting the space formatting for the otherlanguages or requiring an upgrade of the entire formatting module 10.

The present invention also provides a corresponding method forformatting spaces in an electronic character sequence that haspreferably been entered by a user. Turning to FIG. 1 and the abovedescribed formatting module 10 and system 100 comprising a formattingmodule 10, the method comprises identifying whether the electroniccharacter sequence is written in a language supported by the formattingmodule; identifying, with the character identifier 40 (see FIG. 2), aparticular character or a particular sequence of characters in theelectronic character sequence; and formatting, with the formattingmodule 10 if a supported language is identified, spaces in theelectronic character sequence on the basis of the language identifiedand the particular character or sequence of characters identified. Aswill be apparent from the description of the formatting module 10 andthe system 100 comprising a formatting module 10, the formatting modulepreferably supports a plurality of languages, and the most likelysupported language can be identified by a language identifier 20 of theformatting module 10 or a language identifier of the prediction engine30 of the system 100.

Other aspects of the method of the present invention can be readilydetermined by analogy to the above system description. For example, theformatting module comprises a language identifier to identify whetherthe electronic character sequence is written in a language supported bythe formatting module and to identify the most likely language of theelectronic character sequence. The method, by analogy to the formattingmodule, will also comprise selecting, with a comparison mechanism 10,the set of rules that correspond to the most likely language identified,etc.

The present invention also provides a computer program productcomprising a computer readable medium having stored thereon computerprogram means for causing a processor to carry out the method accordingto the present invention.

The computer program product may be a data carrier having stored thereoncomputer program means for causing a processor external to the datacarrier, i.e. a processor of an electronic device, to carry out themethod according to the present invention. The computer program productmay also be available for download, for example from a data carrier orfrom a supplier over the internet or other available network, e.g.downloaded as an app onto a mobile device (such as a mobile phone) ordownloaded onto a computer, the mobile device or computer comprising aprocessor for executing the computer program means once downloaded.

It will be appreciated that this description is by way of example only;alterations and modifications may be made to the described embodimentwithout departing from the scope of the invention as defined in theclaims.

1. A system, comprising: a processor; memory storing instructions that,when executed by the processor, configure the processor to: identifywhether an electronic character sequence is written in a supportedlanguage; identify a particular character or a particular sequence ofcharacters in the electronic character sequence; and format spaces inthe electronic character sequence on the basis of the languageidentified and the particular character or sequence of charactersidentified, when the supported language is identified.
 2. The system ofclaim 1, wherein the instructions that format spaces in the electroniccharacter sequence insert and/or delete spaces in the electroniccharacter sequence.
 3. The system of claim 1, wherein the memory stores:at least one set of rules, each rule relating to a particular characteror sequence of characters to be identified in the electronic charactersequence; wherein each rule is associated with one or more actions whichdescribe the format of spaces to be applied to the electronic charactersequence given a supported language and the particular character orsequence of characters; wherein the instructions configure the processorto: compare each rule of one of the at least one set of rules to theelectronic character sequence to identify whether a rule is applicable;format spaces in the electronic character sequence by applying the oneor more actions associated with the applicable rule to the electroniccharacter sequence.
 4. The system of claim 3, wherein the instructionsconfigure the processor to compare each rule of one of the at least oneset of rules to the electronic character sequence only when a supportedlanguage is identified.
 5. The system of claim 1, wherein the systemsupports a plurality of languages and the instructions further configurethe processor to identify the most likely language of the supportedlanguages that the electronic character sequence is written in.
 6. Thesystem of claim 1, wherein the instructions configure the processor toidentify a punctuation mark and is configured to format the spaceseither side of the punctuation mark on the basis of the punctuationmark.
 7. The system of claim 1, wherein the instructions configure theprocessor to identify a particular context in the electronic charactersequence, and format the spaces in the electronic character sequence onthe basis of the context.
 8. The system of claim 1, wherein theinstructions configure the processor to identify a punctuation mark inthe electronic character sequence and format the spaces either side ofthe punctuation mark on the basis of the category of punctuation mark.9. The system of claim 3, wherein the one or more actions comprise asequence of actions, wherein when a rule is found to be applicable, theinstructions configure the processor to apply the sequence of actions tothe electronic character sequence.
 10. The system of claim 5, whereinthe memory comprises a plurality of sets of rules, one set of rules foreach language that is supported, and the instructions configure theprocessor to compare each rule of the set of rules that corresponds tothe most likely language to the electronic character sequence.
 11. Thesystem of claim 10, wherein the memory comprises sets of rules relatingto each language, each family of languages, and all languages in theworld and wherein the rules are applied in a hierarchal structure suchthat, once a supported language has been identified, the instructionsconfigure the processor to first compare each rule from the set of rulesspecific to that language, followed by each rule from the set of rulesapplicable to the family of languages to which that language belongs,followed by each rule of the set of rules which are applicable to alllanguages until an applicable rule is identified or no applicable ruleis identified and all rules are exhausted.
 12. The system of claim 3,wherein the instructions configure the processor to compare the rules ina specific predetermined order, wherein the set of rules comprisescontext rules, character rules and category rules and the processor isconfigured to compare the rules in the following order until anapplicable rule is identified or no applicable rule is identified andall rules are exhausted: context rules, character rules, and thencategory rules.
 13. (canceled)
 14. The system of claim 1, wherein theparticular character identified in the electronic character sequences isa punctuation mark; and the instructions configure the processor toformat spaces in the electronic character sequence on the basis of thelanguage in which the electronic character sequence is written, thepunctuation mark identified, and a context of the punctuation mark, whena supported language is identified.
 15. The system of claim 1, whereinthe instructions configure the processor further to generate a correctedelectronic character sequence from the electronic character sequence;and format spaces in the corrected electronic character sequence when asupported language is identified.
 16. A system, comprising: a processor;memory storing instructions that, when executed by the processor,configure the processor to; receive an electronic character sequence;identify which language the electronic character sequence is most likelywritten in; correct the electronic character sequence on the basis ofthe identified language generate a corrected electronic charactersequence; identify a particular character or a particular sequence ofcharacters in the corrected electronic character sequence; and formatspaces in the corrected electronic character sequence on the basis ofthe language identified and the particular character or the particularsequence of characters identified.
 17. A method, comprising: identifyingwhether an electronic character sequence is written in a supportedlanguage; identifying a particular character or a particular sequence ofcharacters in the electronic character sequence; formatting spaces inthe electronic character sequence on the basis of the languageidentified and the particular character or sequence of charactersidentified, when a supported language is identified.
 18. (canceled) 19.The method of claim 17, further comprising identifying the most likelylanguage of the electronic character sequence.
 20. (canceled)
 21. Themethod of claim 17, further comprising: comparing each rule of one of atleast one set of rules, each rule defining the formatting of spaces inthe electronic character sequence, to the electronic character sequenceto identify whether a rule is applicable to the character sequence;identifying that a particular rule is applicable to the charactersequence; and applying the applicable rule to the electronic charactersequence to format the spaces in the electronic character sequence. 22.(canceled)
 23. (canceled)
 24. The method of claim 17, whereinidentifying a particular character comprises identifying a punctuationmark and formatting the spaces in the electronic character sequencecomprises formatting the spaces either side of the punctuation mark onthe basis of the form of the punctuation mark.
 25. The method of claim17, wherein identifying a particular sequence of characters comprisesidentifying a particular context and formatting the spaces in theelectronic character sequence comprises formatting the spaces on thebasis of the context.
 26. The method of claim 17, wherein identifying aparticular character comprises identifying a punctuation mark andformatting the spaces in the electronic character sequence comprisesformatting the spaces either side of the punctuation mark on the basisof the category of punctuation mark.
 27. (canceled)
 28. (canceled) 29.(canceled)
 30. The method of claim 17, wherein the rules are compared ina specific predetermined order, wherein the set of rules comprisescontext rules, character rules and category rules, and the methodcomprises comparing the rules in the following order until an applicablerule is identified or no applicable rule is identified and all rules areexhausted: context rules, character rules, and then category rules. 31.(canceled)
 32. (canceled)
 33. A non-transitory computer readable mediumcontaining program instructions which, when executed by a processor,configure the processor to: identify whether an electronic charactersequence is written in a supported language; identify a particularcharacter or a particular sequence of characters in the electroniccharacter sequence; and format spaces in the electronic charactersequence on the basis of the language identified and the particularcharacter or sequence of characters identified, when a supportedlanguage is identified.